Obtaining Measure Concentration from Markov Contraction 
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Abstract 

Concentration bounds for non-product, non-Haar measures are fairly recent: the first such 
result was obtained for contracting Markov chains by Marton in 1996. Since then, several other 
such results have been proved; with few exceptions, these rely on coupling techniques. Though 
coupling is of unquestionable utility as a theoretical tool, it appears to have some limitations. 
Coupling has yet to be used to obtain bounds for more general Markov-type processes: hidden 
(or partially observed) Markov chains, Markov trees, etc. As an alternative to coupling, we 
apply the elementary Markov contraction lemma to obtain simple, useful, and apparently novel 
concentration results for the various Markov-type processes. Our technique consists of expressing 
probabilities as matrix products and applying Markov contraction to these expressions; thus it 
is fairly general and holds the potential to yield numerous results in this vein. 

1 Introduction 

1.1 Background 

In 1996 Marton [20] published a concentration inequality for contracting Markov chains - apparently, 
the first such result for a non-product, non-Haar measure. In the decade that followed, Marton and 
others continued to distill and expand a key insight: analogues of the Azuma-Hoeff^ding-McDiarmid 
inequality [2, 10, 25] for independent random variables may be obtained for dependent ones, provided 
a strong mixing condition holds. 

To recall, the aforementioned inequality implies that if /x is a product distribution on i7" and 



f-.n^- 



satisfies 



Lip 



< n under the Hamming metric, we have 
^^{\f-^^f\>t} < 2exp(-2nt2). 



(1) 



In [20], Marton pioneered the transportation method for proving concentration inequalities. This 
technique is in principle applicable to arbitrary nonproduct measures, and when applied to Markov 
chains /x with contraction coefficient 6 < 1, it yields 



H{\f -Mf\>t} < 2exp 



-2n t{l-e) 



log 2 
2n 
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where Mf is a ^-median of /. Since product distributions are degenerate cases of Markov chains 
(with 9 = 0), Marton's result is a powerful generalization of (1). 

The Markov contractivity condition 9 < 1 implies strong mixing, and in a series of papers 
[21, 22, 23], Marton gave other concentration results for dependent variables under various metrics 
and types of mixing. In particular. Theorem 2 of [21] gives a generic mixing condition which implies 
a transportation inequality and therefore concentration. 

Further progress in obtaining concentration from mixing was made, among others, in [5, 6, 
15, 16, 28, 29]. Using Stein's method for exchangeable pairs, Chatterjee [5] obtained an elegant 
concentration inequality in terms of a Dobrushin-Shlosman type contractivity condition. Samson 
[29] was apparently the first to use explicit mixing coefficients in a concentration result. Since these 
are central to this paper we define them without further delay; the (standard) notation is clarified 
in Section 1.3. 

Let n be the joint distribution of {Xi, . . . ,Xn), Xi G il. For 1 < i < j < n and x £ Q^, we 
denote by 

fl{{Xj,...,Xn)\{Xu...,Xi)=x) 

the distribution of {Xj, . . . ,Xn) conditioned on {Xi, . . . ,Xi) = x. For y £ Q,^^^ and w,w' £ fi, 
define 

Vij(.y,W,w') = \\n{{Xj,...,Xn)\ {Xi,...,Xi) =yw) - ^i{{Xj,...,Xn)\ {Xi,...,Xi) = yw')\\^^ , 

and 

% = sup r]ij{y,w,w'). (3) 

The coefficients fjij, termed rj-mixing coefficients^ in [15], play a key role in several recent con- 
centration results. Define F and A to be upper-triangular n x n matrices, with Ta = An = 1 
and 

*i — V ^*i' *i — ^ij 

for 1 < i < j < n. 

In 2000, Samson [29] proved that any distribution ^ on [0, 1]" and any convex / : [0, 1]" -^ M 
with U/Ul; < 1 (with respect to £2) satisfy 

^{\f-^f\>t} < 2exp(--^) (4) 

where 11F]]2 is the £2 operator norm. 

In 2007, almost synchronously and using different techniques, Chazottes et al. [6] and the 
author with K. Ramanan [15] showed that any distribution /j, on i7" and any / : il" -^ M with 
. < n~^''^ (with respect to the Hamming metric) satisfy 

f^{\f-f^f\>t} < 2exp(--^) (5) 



That choice of terminology is perhaps suboptimal in hght of the unrelated notion oiri-weak dependence of Doukhan 
et al. [8], but the sufficiently distinct contexts should prevent confusion. 



where ||A||^ is the l^ operator norm (||A||^ may be replaced by ||A||2 and [6] achieves a better 
constant in the exponent). 

The results (4) and (5) are not readily comparable as they hold in different spaces for different 
metrics with different normalization, and the former requires convexity. They share the feature of 
establishing concentration for a wide class of measures, in terms of the natural mixing coefficients 
f\ij. Indeed, since 



l<i<n 

and by the Gersgorin disc theorem [11] 



A|| = max (1 + ryi^i + r/i,i+i + . . . + ?7i,„) 

l<i<n 



n 



1 2 = Amax(r'^r) < max y'(r'^r)jj. 



suitable upper estimates on f/jj provide bounds for ||r||2 and ||A||j^. 

Aside from the straightforward observation (due to Samson) that the r/-mixing coefficients are 
bounded by the (^-mixing ones (see [4]), we are only aware of a few of cases where simple, readily 
computed estimates on f/jj are given. In particular, Samson [29] controls f]ij by the contraction 
coefficients of a Markov chain, and Chazottes et al. [6] give some estimates on f]ij for various 
temperature regimes of Gibbs random fields. The estimates quoted above are obtained via the 
coupling method - which, while powerful, often requires some ingenuity to construct the requisite 
joint distribution, even in the simple case of a Markov chain [20, 29]. In some cases, the coupling 
may even elude explicit construction [6]. 

As the random processes of interest become more complex, it becomes progressively more dif- 
ficult to obtain estimates on mixing coefficients via coupling. We are particularly interested in 
examining the r?-mixing of several Markov- type processes, motivated by statistical and computer 
science applications. Hidden Markov Models (HMMs) have been used in natural language pro- 
cessing [18, 27] and signal processing [24] for decades, with considerable success. Concentration 
bounds for Markov Chains (and more generally, HMMs) have implications in machine learning and 
empirical process theory [9, 13]. A Markov-type process called the Markov marginal process (MMP) 
in [14] underlies adaptive Markov Chain Monte Carlo simulations [1]; these evolve according to an 
inhomogeneous Markov kernel, which in addition to time also depends on the path history. In a 
forthcoming work, A. Brockwell and the author give strong laws of large numbers for MMPs in 
therms of the r/-mixing coefficients. Random processes indexed by trees have been attracting the 
attention of probability theorists for some time [3, 26], and the principal technical contribution of 
this paper is a bound on fiij for these types of processes. 

Our results do not invoke the coupling method but rather rely on the Markov contraction lemma 
(Lemma 2.1). The technique provides novel concentration bounds for the processes listed above - 
results which the coupling method has yet to yield or reproduce. 

Remark 1.1. On some level, the distinction between our technique and the coupling method is 
semantic. From conversations with experts it appears that what we call here "Markov contraction" is 
commonly referred to as "coupling" . The novelty of our method lies in (i) avoiding any constructions 
(implicit or explicit) of joint distributions (ii) rewriting complicated sums as simple(r) matrix and 
tensor products (iii) applying Lemma 2.1 to the latter expressions. Thus it seems that our method 
is sufficiently different from classical coupling techniques, both in execution and results obtained, 
to merit the terminological distinction. 



1.2 Main results 

In this paper we present estimates on the 77-mixing coefficients fjij defined in (3), for the various 
Markov-type processes mentioned above. These bounds immediately imply concentration inequah- 
ties for a wide class of metrics and measures, via (4) and (5). 

The precise statements of the results require preliminary definitions and are postponed until 
later sections. The main technical contribution of this paper is Theorem 4.1, which bounds 77- 
mixing coefficients for Markov-tree processes, yielding what appears to be the first concentration 
of measure result for these. However, we give equal priority to the goal of presenting Markov 
contraction as a versatile new method for bounding fjij. The nature of the bounds is to control f/jj 
- a global function of the distribution ^u - by some local, easily computed contraction coefficients 
of fi. For example, let /j, be an inhomogeneous Markov chain defined by the transition kernels 
{pi : < i < n}, which induces a density on Q^ by 

n-l 



H{x) = po{xi)Y]_Pi{xi+i\xi), xeQ'' 



i=l 



Define the i^^ contraction coefficient: 

9i = sup \\pi{-\y) -Pi{-\y')\\^^, l<i<n. (6) 

This quantity turns out to control the ry-mixing coefficients for ^u: 

Vij < OiOi^i . . . Oj^i 

- a fact which is proved in [29] using coupling. In [15] we gave an (arguably simpler) alternative 
proof, which paves the way for the several new results presented here. 

This paper is organized as follows. In Section 1.3 we summarise some basic notation used 
throughout the paper. Some auxiliary lemmas are given in Section 2. The remaining three sections 
deal with bounding fjij for Markov chains, Markov tree processes, and Markov marginal processes, 
respectively. 

1.3 Notation and definitions 

Since the contribution of this paper is not measure-theoretic in nature, we henceforth take fi to be 
a finite set. Extensions to the countable case are quite straightforward [15] and the continuous case, 
under mild assumptions, is not much more difficult [13, 14]. 

We use the terms measure, density and distribution interchangeably; all measures are proba- 
bilities unless noted otherwise. If /i is a measure on il" and / : fi" -^ M, we use the standard 
notation 

fJ-f = fdfi 

and write 

^{\f-^f\>t} 



as a shorthand for 

^({xGf)":|/(x)-^/|>t}) 
The (unnormahzed) Hamming metric on fi" is defined by 



d{x,y) = Y,^{'■c^^y.}^ a;,yGO", (7) 



i=l 



where the indicator variable Ir.i assigns 0-1 truth values to the predicate in {•}. 
The Lipschitz constant of a function, with respect to some metric d, is defined by 



l/W-/(y)l 



""""P"^ d{x,y) 



Random variables are capitalized (X), specified sequences are written in lowercase (x G f^"), 
the shorthand X| = (Xi, . . . ,Xj) is used for all sequences, and sequence concatenation is denoted 
multiplicatively: x^x^^^ = xf. Sums will range over the entire space of the summation variable; 
thus > f{xj) stands for 



By convention, when i > j, we define 






E/(^')^/(^) 



where e is the null sequence. Products of spaces and measures are denoted by 0. 

The total variation norm of a signed measure u on fi" (i.e., vector u G M. ) is defined by 



I ^ 1 1 TV 2 



5 11^^111 = h E '""(^^i 

xen" 



(the factor of 1/2 is not entirely standard). For readability, we will drop the subscript TV from the 
norm; thus everywhere in the sequel, ||-|| will mean ||-||^y. 

A signed measure zv on a set X is called balanced if iy{X) = 0. Departing from standard 
convention, our stochastic matrices will be column- (as opposed to row-) stochastic. 

2 Contraction and tensorization 

Our method for bounding ry-mixing coefficients rests on the following simple result: 
Lemma 2.1. Let P : M.^ -^ M.^ be a Markov operator: 

{Pv){x) = Y,P{x\y)u{y), 
yen 



where P{x | y) > and ^^g^ P{x | y) = 1. Define the contraction coefficient of P as above: 

6 = max LP(- | y) — P{- \ y')\\ ■ 
y,y'en 

Then 

\\Pu\\ < e\\u\\ 

for any balanced signed measure u on Q, (i.e., u ^ M.^ with "^^^q ^[x) = 0). 

This result is sometimes credited to Dobrushin [7]; the quantity 9 has been referred to in the 
hterature, alternatively, as the Doeblin contraction or Dobrushin ergodicity coefficient. However, 
the observation apparently goes as far back as Markov himself [19] (see [15] for a proof), so it seems 
appropriate to refer to the result above as the Markov contraction Lemma. 

Another important property of the total variation norm is that it tensorizes, in the following 
way: 

Lemma 2.2. Consider two finite sets X,y, with probability measures p,p' on X and q,q' on y. 
Then 

\\p0q — p'^q'\\ < ||p — p'll + II^Z — ^'ll — Up — p'll II^Z — ^'ll • 

This fact seems to be folklore knowledge; we were not able to locate it in published literature. 
A proof using coupling is straightforward, and a non-coupling proof is given in [13]. 

3 Markov chains 

3.1 Directed 

Technically, this section might be considered superfluous, since this result has already appeared in 
[15], and is strictly generalized in later sections. However, we find it instructive to work out the 
simple Markov case as it provides the cleanest illustration of our technique. 

Let fi be an inhomogeneous Markov measure on fi", induced by the kernels pq and pi{- \ •), 
1 < i < n. Thus, 

ra-l 
fl{x) = po{xi)Y[Pi{Xi+l\Xi). 
1=1 

The i contraction coefficient, 6i is defined as in (6). As stated in the Introduction, Markov 
contraction provides an estimate on estimate ry-mixing: 

Theorem. 

fjij < Oid'i+i ■ ■ -Oj-i- 
Proof. Fix 1 < i < j < n and y\'^ G il*"^, Wi, w[ G il. Then 



^E^(^i)ic(^^-)i 



where 

i-i 

t=k 

and 



C{Xj 



^ Pj^i{xj I Zj-i)tt{zI^I) {pi{zi+i I Wi) - Piizi^i I w'i)) , j -i> I 

4+1 (8) 

Pi{xj I Wi) - pi{xj I tt;^), j -i = l. 



Define h G M*^ by h^ = pi{v\wi) — pi{v \ w^ and P^^'^ G ]^nxn j^y p(j _ p^^^ | y). Likewise, 
define z G M^ by z^, = ({v). It follows that 

Z = p0-l)p(i-2) . . . p(«+2)p(i+l)jj. 

Therefore, 

riij{y,w,w') = i^7r(a;")|za;J 



2 Z-/ ra;j| z^ "v^^j. 

^3 ^7+1 



a; j I III 



The claim follows by (repeated applications of) the Markov contraction lemma. D 

The reader may wish to compare this proof with Samson's [29]. 

3.2 Undirected 

In this section we analyze Markov chains under a different parametrization, in an "undirected 
graphical model" setting [17]. For any graph G = (y,E), where |y| = n and the maximal cliques 
have size 2 (i.e., are edges), we can define a measure on Q = O" as follows 

M(a;) = ^^ ^4 , . / /v ^ ^ ^ 

l^x'en" \.\.{i,j)eE'Vij\^v^j) 

for some for some nonnegative "potential functions" ipij. 

Consider the very simple case of chain graphs; any such measure is a Markov measure on 0,^. 
We can relate the induced Markov transition kernel pi{- \ ■) to the random field measure // as follows: 

Ex'en E^'-i ^z'^+2 i^i^y^^) 

Our goal is to bound the i^^ contraction coefficient 6i of the Markov chain in terms of ipij. We 
claim a simple relationship between 9i and ipij: 

7 



Theorem 3.1. 



< 



Ri + Ti 



I < i < n 



(9) 



where 



and 



Ri = max ipi^i+i{x,y) 



ri = mill 'il)i^i+i{x,y). 

x,y£Q 



Lemma 3.2. Let a,f3,je M++^ and r,R€R be such that < r < ai, f3i < R , for I < i < k + 1. 



First we prove a simple lemma: 

mi 

Then 



k+l 



i=l 



am 



Pm 






< 



R-r 

R + r' 



(10) 



Proof. When p,q ^ M_(_ are two distributions satisfying < r < pi,qi, it is straightforward to 
verify that \\p — g||^ may be maximized, with value d, by choosing a G [r, (1 — d)/k], b = a + d/k 
and setting pi = a., qi = b iov 1 < i < k and Pk+i = 1 — ^a, ^fe+i = I — kb. Applying this principle 
to (9), we obtain 



fc+i 

E 

i=l 



am 



Pili 






< 



gkR — g'r g'R — gkr 
gkR + g'r g'R + gkr 

2g"k{R^ - r^) 
{R + g"kr){g"kR + r) 



where g = Yl^^i Ih o' = Ik+i and g" = g/g'. 
Define / : M+ ^ M by 



2(i?2 _ r2)x 



{R + rx){Rx + r) 
elementary calculus verifies that / is maximized at x = 1. 

Proof of Theorem 3.1. Let us define the shorthand notation: 

^(4) = Wi't,t+i{ut,ut+i] 



D 



t=k 



Then we expand 



Pi{x\y) 



Ex'en Et,'»^-i E27_^2 7r(t;"f 2)V'i_i_i(?;^_-^, y)V'i,i+i(y, x')V'i+i,i+2(a;', ^•+2)^(^'r+2) 
i'i,i+i{y-,x)ayx 

T.x'(^Qi^i,i+liy^x')aya;' 



where 



''■yx 



^ ^7r(f*i ^)V'i-i,j(fj-i,y)V'i+i,i+2(x,Zi+2)vr(zr+2) 



i— 1 5." 



(we take the natural convention that ipij{- 1 •) = 1 whenever {i,j) ^ E). 
Fix y, y' G fi. Define the quantities, for each x G il: 



Then 



J2 \pi(^\y') -Pii^\y'] 

xefi 



the last equality follows since 7^ = cyx, where c 
establish the claim. 



oix = i'i, 


i+iiy,x) 




Pec = 'tpi,i+iiy',x) 


1x ^ 0,yx 


I'x = O'y'x- 


= 1. 

xen 


Otxlx Pxix 


(11 


T^x'en "^'7x' Ex'en /^^'T^' 


= >: 

xen 


ax7:r /3x7x 


; (12 


Ex'en o^x'lx' Ex'en ^^'Tx' 


where c = 


"^'-^■'^^'-^'^'^ Now Lemma 3.2 


can be applied t( 



D 



4 Markov tree processes 

4.1 Preliminaries 

We begin by defining some notation specific to this section. A collection of variables may be 
indexed by subset: if x G O^ and / C F with / = {ii,i2, ■ ■ ■ ,im}, then we write x/ = x[I] = 
{xii,Xj2, . . . ,Xi^}; we will write x/ and x[I] interchangeably, as dictated by convenience. To avoid 
cumbersome subscripts, we will also occasionally use the bracket notation for vector components. 
Thus, if u G R^\ then 



Uxj = Ux[I] = U[xj] = U[x[l]] = Uf^x.^^Xi. 



^•^im) 



for each x[/] G fi^. A similar bracket notation will apply for matrices. If A is a matrix then 
A^:j = A[*,j] will denote its j^^ column. We will use |-| to denote set cardinalities, and write [n] 
for the set {1, . . . , n}. Probabilities are denoted by P in this section. 

If G = {V, E) is a graph, we will frequently abuse notation and write u € G instead oi u £ V, 
blurring the distinction between a graph and its vertex set. This notation will carry over to set- 
theoretic operations (G = Gi n G2) and indexing of variables (e.g., Xq)- 

4.2 Graph theory 

Consider a directed acyclic graph G = (y,E), and define a partial order -<g on G by the transitive 
closure of the relation 

u -<G V if {u,v) G E. 



9 



We define the parents and children of w G F in the natural way: 

parents(t;) = {u £ V : {u,v) £ E} 

and 

children(f) = {w £ V : {v,w) £ E}. 

If G is connected and each v £ V has at most one parent, G is called a (directed) tree. In a 
tree, whenever u -<g v there is a unique directed path from u to v. A tree T always has a unique 
minimal (with respect to ^t) element tq £ V, called its root. Thus, for every v £ V there is a 
unique directed path ro ^t ^i -<t ■ ■ ■ -<T fd = v; define the depth of v, depji(w) = d, to be the 
length (i.e., number of edges) of this path. Note that dep2-(ro) = 0. We define the depth of the tree 
by dep(T) = sup^gj- dep;r(^')- 

For d = 0, 1, . . . define the d^^ level of the tree T by 

levrid) = {v £ V : depy(w) = d}; 
note that the levels induce a disjoint partition on V: 

dep(T) 

V= [j levrid). 

d=l 

We define the widtf? of a tree as the greatest number of nodes in any level: 

wid(r) = sup |levT(d)| . (13) 

l<d<dcp{T) 

We will consistently take |y| = n for finite V . An ordering J : y ^ N of the nodes is said to be 
breadth-first if 

depr(u) < depy(i;) =^ J{u) < J{v). (14) 

Since every finite directed tree T = (V, E) has some breadth-first ordering,^ we will henceforth blur 
the distinction between v £ V and J{v), simply taking V = [n] (or y = N) and assuming that 
deprp[u) < deprp{v) ^ u < v holds. This will allow us to write OX simply as $7" for any set Vt. 

Note that we have two orders on V: the partial order ^y, induced by the tree edges, and the 
total order <, given by the breadth- first enumeration. Observe that i -<t j implies i < j but not 
vice versa. 

If T = {V, E) is a tree and u £ V, we define the subtree induced by u, T^ = {Vu,Eu) by 
Vu = {v£V ■.u:<T v}, Eu = {(f , w)£E:v,w£ K}- 

4.3 Markov tree measure 

If ri is a finite set, a Markov tree measure /U is defined on fi" by a tree T = {V, E) and transition 
kernels pq, {pij : {i,j) £ E}. Continuing our convention above, we have a breadth-first order < and 



^This definition is nonstandard. 

^One can easily construct a breadth-first ordering on a given tree by ordering the nodes arbitrarily within each 
level and listing the levels in ascending order: levT(l),levT(2), . . .. 
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the total order -<x on V^ and take V = {1, . . . ,n}. Together, the edges of T and the transition 
kernels determine the distribution ^ on il": 

IJ-ix) = Pq{xi) W pij{xj\xi), xeQ"-. (15) 

A measure on ri" satisfying (15) for some T and {pij} is said to be compatible with tree T; a measure 
is a Markov tree measure if it is compatible with some tree. 

Suppose $7 is a finite set and (Xj)jgN, Xj G i7 is a random process defined on (fi , P). If for 
each n > there is a tree T*-"-* = {[n\,E^"'') and a Markov tree measure //^ compatible with T^"' 
such that for all x G Q^ we have 

P{Xr = X}= fin{x) 

then we call X a Markov tree process. The trees {T^")} are easily seen to be consistent in the sense 
that T*-"^ is an induced subgraph of T^"'~^^' . So corresponding to any Markov tree process is the 
unique infinite tree T = (N, E). The uniqueness of T is easy to see, since for f > 1, the parent of v 
is the smallest u G N such that 

"i^i) ^ x^ I Jii = Xi\ ^ r-[Jiv = Xi, I A« = Xu\ ; 

thus P determines the edges of T. 

It is straightforward to verify that a Markov tree process {X^jtigT compatible with tree T has 
the following Markov property: if v and v' are children of u in T, then 

P{Xt„ = X, Xt^, =x'\Xu = y]= P{Xt„ =x\Xu = y] P{Xt„, = x' | X„ = y} . 

In other words, the subtrees induced by the children are conditionally independent given the parent; 
this follows directly from the definition of the Markov tree measure in (15). 

4.4 Statement of result 

Theorem 4.1. Let Q he a finite set and let (Xj)i<j<n, Xi ^ Q he a Markov tree process, defined 
by a tree T = (y,E) and transition kernels po, {Puv{-\-)}(uv)eE- Define the {u,v)- contraction 
coefficient 6uv by 

9uv = max \\puvi-\y) -Puvi-\y')\\- (16) 

Suppose max(u ,;)g^ 9uv < < 1 for some 9 and wid(T) < L . Then for the Markov tree process X 
we have 

% < {l-{l-9)^f'-'^"^^ (17) 

for 1 < i < j < n. 

To cast (17) in more usable form, we first note that for A;, L G N with k > L, we have 
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(we omit the elementary number-theoretic proof). Using (18), we have 

Vij < G^'\ ioi j>i + L (19) 

where 

^~=(l_(l_^)i)l/(2L-l). 

this imphes the dimension-free bound 

||A|| < L-l + (l-^)-^ 

In the (degenerate) case where the Markov tree is a chain, we have L = 1 and therefore 9 = 9; 
thus we recover Theorem 3.1. 

4.5 Proof of Theorem 4.1 

The proof of Theorem 4.1 is combination of elementary graph theory and tensor algebra. We start 
with a graph-theoretic lemma: 

Lemma 4.2. Let T = ([n], E) be a tree and fix 1 < i < j < n. Suppose (Xj)i<j<„ is a Markov tree 
process whose distribution P on fi" is compatible with T (in the sense of Section 4-3) ■ Define the 
set 

7f = r,n{j,i + i,...,n}, 

consisting of those nodes in the subtree Ti whose breadth-first numbering does not precede j. Then, 
for y G Q,'^"^ and w, w' G il, we have 

r]i(y,w,w') = i ^' ^i = ® (20) 

^'^ ' ' \ riijf^{y,w,w'), otherwise, 

where jo is the minimum (with respect to <) element of T- . 

Remark 4.3. This lemma tells us that when computing rjij it is sufficient to restrict our attention 
to the subtree induced by i. 

Proof. The case j G Ti implies jo = j and is trivial; thus we assume j ^ Tj. In this case, the subtrees 
Ti and Tj are disjoint. Putting Ti = Ti\ {i}, we have by the Markov property, 

PJX^.^ = xf^ , Xt, = XT, I Xi = yw} = F{Xf^ = Xf^ \X, = w} P{Xt, = xt, | X^' = y} . 

Then from the definition of rjij and by marginalizing out the Xt , we have 

v^J{y,w,w') = ij;|p{x; = x-|xi = yu;}-p{x; = x^"|x{ = W}| 

= ^ ^ P| Xrpj = Xrpj \Xi = W> -PI Xrpj = X rpj \Xi = w'> . 



If T- = then obviously r/^ = 0; otherwise, rjij = rjij^^, since jo is the "first" element of T/. D 
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Next we develop some basic results for tensor norms. If A is an M x iV column-stochastic matrix 
(i.e., Aij > for 1 < i < M, 1 < j < iV and YlZi ^ij = ^ fo^ ^11 1 < j < iV) and u G R^ is 
balanced in the sense that J2j=i ^j = 0) '^^ have, by Lemma 2.1 

||Au|| < ||A||||u||, (21) 

where 

||A|| = max A* ,• — A* ,•/ (22) 

and A* J- = A[*,j] denotes the j column of A. An immediate consequence of (21) is that ||-|| 
satisfies 

||AB|| < ||A||||B|| (23) 

for column-stochastic matrices A G ^J^xN ^^^^ g ^ M . 

Remark 4.4. Note that if A is a column-stochastic matrix then ||A|| < 1, and if additionally u is 
balanced then Au is also balanced. 

If u G R and v G M , define their tensor product ■w = v ® u by 

where the notation (v ^)(i,j) is used to distinguish the 2-tensor w from an M x A^ matrix. The 
tensor w is a vector in M indexed by pairs {i,j) G [M] x [N]; its norm is naturally defined to be 



|W|1 -2 



h E k(M-)|- (24) 

{i,j)e[M]x[N] 



To develop a convenient tensor notation, we will fix the index set V = {1, . . . ,n}. For I C V, 
a tensor indexed by / is a vector u G M^ . A special case of such an /-tensor is the product 
u = 0jg/ v(*\ where vW £ Rp and 



uM = JJv(-'[Xij 






for each x/ G Jl^. To gain more familiarity with the notation, let us write the total variation norm 
of an /-tensor: 



|U|| - 2 



h E l"NI- (25) 



In order to extend Lemma 2.2 to product tensors, we will need to define the function a,fc : M^ ^ M 
and state some of its properties: 

Lemma 4.5. Define a^ : M^ ^ R recursively as ai{x) = x and 

ak+i{xi,X2,...,Xk+i) = Xk+i + (1 - Xk+i)ak{xi,X2,... ,Xk). (26) 

Then 
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(a) ak is symmetric in its k argum,ents, so it is well-defined as a mapping 

a:{xi:l<i<k}^R 

from finite real sets to the reals 

(b) Ofc takes [0, 1]^ to [0, 1] and is monotonically increasing in each argument on [0, 1]*^ 

(c) If B C C C [0, 1] are finite sets then a{B) < a{C) 

(d) afc(x, X, . . . , x) = 1 - (1 - x)^ 

(e) if B is finite and 1 G i? C [0, 1] then a{B) = 1. 

(f) if B C [0, 1] is a finite set then a{B) < J2xeB ^■ 

Remark 4.6. In light of (a), we will use the notation ak{xi,X2, ■ ■ ■ ,Xk) and a{{xi : 1 < i < k}) 
interchangeably, as dictated by convenience. 

Proof. Claims (a), (b), (e), (f) are straightforward to verify from the recursive definition of a and 
induction. Claim (c) follows from (b) since 

ak+iixi,X2,...,Xk,0) = akixi,X2,...,Xk) 

and (d) is easily derived from the binomial expansion of (1 — x) . 
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The function q^ is the natural generalization of a2ixi,X2) = xi-\- X2 — xiX2 to k variables, and 
it is what we need for the analog of Lemma 2.2 for a product of k tensors: 

Corollary 4.7. Let {u'*^}jg/ and {v'*''}jg/ be two sets of tensors and assume that each o/u'*-*,v'*^ 
is a probability measure on fl. Then we have 



u 



(i) 



.» 



iei iei 

Proof. Pick an io G / and let p = u'*"), q = v^*"', 

(i) 



< a|||uW-v»|| :iG/}. 



(27) 



p = (X) u 



q = ^5; V 



(i) 



Apply Lemma 2.2 to ||p (X" q — p' q'|| and proceed by induction. 
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Our final generalization concerns linear operators over /-tensors. For /, J C V, an /, J-matrix 
A has dimensions \^ \ x \Q, \ and takes an /-tensor u to a J-tensor v: for each yj G Q , we have 



v[?/j 



^ A[yj,x/]u[x/], 



(28) 



xi£fi' 



which we write as Au = v. If A is an /, J-matrix and B is a J, /f-matrix, the matrix product BA 
is defined analogously to (28). 
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As a special case, an /, J-niatrix might factorize as a tensor product of \Q\ x |Q| matrices 
\i^'j) g IRf^>^f^. "We will write such a factorization in terms of a bipartite graph^ G = (/ + J,E), 
where E C I x J and the factors A^*'-'-' are indexed by {i,j) G E: 

A= (g) A(*'^), (29) 



where 

A fl/ T T t1 II \ 



Myj,^i]= n ^F 



for all x/ G Q^ and yj G fi"^. The norm of an /, J-matrix is a natural generalization of the matrix 
norm defined in (22): 

||A|| = max ||A[*,x/] - A[*,x/]|| (30) 

where u = A[*,x/] is the J-tensor given by 

^[yj] = A[yj,x/]; 

(30) is well-defined via the tensor norm in (25). Since I, J matrices act on /-tensors by ordinary 
matrix multiplication, ||Au|| < ||A|| ||u|| continues to hold when A is a column-stochastic I,J- 
matrix and u is a balanced /-tensor; if, additionally, B is a column-stochastic J, /iT-matrix, ||BA|| < 
||B|| ||A|| also holds. Likewise, since another way of writing (29) is 

A[*,x/]= (g) A(*'^')[*,x,], 
{i,j)eE 

Corollary 4.7 extends to tensor products of matrices: 

Lemma 4.8. Fix index sets /, J and a bipartite graph (/+ J, E). Let \ A'*'-'^ > be a collection 

I J ii,j)eE 

of column- stochastic \Q\ x \Q\ matrices, whose tensor product is the I, J matrix 

A= (g) A(^'^). 

Then 

||A|| < a|||A(*'^')|| : (i,j) G-EJ. 

We are now in a position to state the main technical lemma, from which Theorem 4.1 will follow 
straightforwardly: 

Lemma 4.9. Let 0, be a finite set and let (^j)i<i<n; Xi £ Q be a Markov tree process, defined by 
a tree T = {V, E) and transition kernels po, {puv{- \ ')}(uv)€E- ^^^ ^^^ {u,v) -contraction coefficient 
9uv be as defined in (16). 



Our notation for bipartite graphs is standard; it is equivalent to G = (J U J, E) where / and J are always assumed 
to be disjoint. 
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Fix 1 < i < j < n and let Jq = Jo{i,j) be as defined in Lemma 4-2 (we are assuming its existence, 
for otherwise fjij = Q). Then we have 



deprOo) 
d=deprp{i)+l 



(31) 



Proof. For y £ fl^ and Wjw' £ il, we have 



(32) 



^E 



^[F{xr^, = zi-^x]\Xl=yw} 






■{ 



-P\Xr^,=zi;lx]\Xi=yw'} 



(33) 



Let Tj be the subtree induced by i and 

Z = T,n{i + l,...,Jo-l} and C = {v £ Ti : {u,v) £ E,u < Jo,v > jo}. (34) 

Then by Lemma 4.2 and the Markov property, we get 

r]ij{y,w,w') = 

Y^ ( P{X[C UZ]= x[C UZ]\Xi = w}- P{X[C UZ]= x[C UZ]\Xi = w'} 



x[C] 



x\Z] 



(35) 



(the sum indexed by {jo, . . . .,n}\C marginaUzes out). 

Define D = {dk : /c = 0, . . . , \D\} with do = depy(i), (i|£)| = dep2-(jo) and dk+i = dk + I for 
< A; < \D\. For d £ D, let Id = Ti n levT(d) and Gd = {Id~i + Id, Ed) be the bipartite graph 
consisting of the nodes in Id-i and Id, and the edges in E joining them (note that I^,, = {i}). 

For {u,v) £ E, let A*^"'") be the \Q\ x \Q\ matrix given by 



{u,v) 



and note that 



jS^{u,v) 



X x' — PuvyX I X ) 
Then by the Markov property, for each z[Id] £ ^^"^ and a:[/(i_i] £ 



Q^-i-^, d£ D\ {do}, we have 

where 

A^'^) = (g) A^"'''). 

(u,v)eEa 
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Likewise, for d G D \ {do}, 



r' r" (d-l) 



Define the (balanced) I^j -tensor 



the /^,„, -tensor 



P{Xj, = x'j^ \Xi = w} P{Xj, = xl \Xj,=x'j^}--- 
p{Xj^ = xj,\Xj^_^=xf~l^] 



and Co,Ci,Zo C {l,...,n}: 

Co = Cn/dcpj,(io)' Ci = C\Co, Zo = /dcpyQu) \ Co, 



(36) 

(37) 

(38) 
(39) 



where C and Z are defined in (34). For readabihty we will write P{xu \ ■) instead of F{Xu = xu \ •} 
below; no ambiguity should arise. Combining (35) and (36), we have 



r]ij{y,w,w') 



^ EE (P(^[C UZ]\Xi = w)- F{x[C UZ]\Xi = w')) I 
Y,P{x[Ci]\x[ZomCoUZo 



xc xz 



llBfll 



^Zn 



(40) 

(41) 
(42) 



where B is the |ri'-^'^^'-^i| x iri*-""^^"! column-stochastic matrix given by 

B[xco U xc,,x'c^ U xzq] = If _ , -1 P(xci I xzo) 

with the convention that P{xc\ \ xzq) = 1 if either of Zq or Ci is empty. The claim now follows by 
reading off the results previously obtained: 



|Bf|| < ||B||||f|| 
< llfll 



< iihiin 



\D\ 
k=2 



J^idk) 



Eq. (21) 
Remark 4.4 

Eqs. (23,38) 



< nl=i a{ 1 1 A^"-'^) \\:{u,v)eEd,} Lemma 4., 



D 
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Proof of Theorem 4-1- We will borrow the definitions from the proof of Lemma 4.9. To upper-bound 
fjij we first bound a{ A'"'^^ : {u,v) G E^^}. Since 

\Ed,\ <wid{T)<L 

(because every node in Id,, has exactly one parent in Id^_i) and 



jS^{u,v) 



<e<i, 



we appeal to Lemma 4.5 to obtain 

y{\\A(^'-^\\:{u,v)eEdJ < 1 - (1 - 0)^. 



a-i 



(43) 



Now we must lower-bound the quantity h = depT(jo) — depx{i). Since every level can have up to 
L nodes, we have 

jo-i< hL 

and soh> [{jo - i)/L\ > [(j - i)/L\ . D 

The calculations in Lemma 4.9 yield considerably more information than the simple bound in 
(17). For example, suppose the tree T has levels {Id : d = 0,1, . . .} with the property that the levels 
are growing at most linearly: 

\Id\ < cd 
for some c > 0. Let di = depj^(i), dj = dep'p(jo), and h = dj — di. Then 

dj 

j -i < jo-i < c ^ /c 

d,+i 



{dj{dj + l)-di{di + l)) 



< ^m + iy-di 

< ^{di + h+if 



so 



h> v/2(j - i)/c -d^-1, 
which yields the bound, via Lemma 4.5(f), 

h 

Vij < Y\. X^ ^™- 

fc = l (M,D)G-Efe 

Let 9k = in.ax{9uv '■ iu,v) G E^}; then if ck9k < (3 holds for some /3 G M, this becomes 

h 

Vij < W{ck9k) 

k=l 
^j2{j-i)lc-d,-\ 

< n (c^^*^) 



(44) 



fc=i 



< 



fj\/'^ij-i)/c-d, 



-1 



(45) 
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This is a non-trivial bound for trees with hnearly growing levels: recall that to bound ||A||^, we 
must bound the series 

oo 

j=i+l 
By the limit comparison test with the series ^yt=i l/j^j '^s have that 



y^ Ijyj2(j-i)/c-d,-l 

converges for f3 < 1. Similar techniques may be applied when the level growth is bounded by other 
slowly increasing functions. It is hoped that this method will be extended to obtain concentration 
bounds for larger classes of directed acyclic graphical models. 



5 Markov marginal processes 

In this section, we define a random process that strictly generalizes Markov and hidden Markov 
chains. It was first defined in the author's forthcoming work with A. Brockwell [14], where it is 
termed a Markov marginal process and applied to the analysis of adaptive Markov Chain Monte 
Carlo algorithms. 

Consider two finite sets, Q, (the "hidden state" space) and f2 (the "observed state" space). 

Let n he a Markov measure on ($7 x (7)" defined by the initial distribution pQ and the kernels 

{Ki{■\■)}^<,<n■ 



/^ 



Po 



Xi 
Xi 



n-1 






(46) 



where for readability we use the stacked notation ( - ) ^ instead of the more standard (x, x) - for 

elements of 17 x Cl. 

A Markov marginal processes (MMP) is a measure p on 17" defined by 



Pli'i 



E 



^ 



(47) 



Let us define the i contraction coefficient 9i of a MMP with kernels {Ki{- \ •)}i<i<n t>y 



max max 



Ki 



X 
X 



Ki 



(48) 



It is shown in [14] that the ?]-mixing coefficients of a MMP may be controlled by its contraction 
coefficients: 

Theorem 5.1. A MMP p on il", as defined above, satisfies 

fjij = OiOi+i ■ ■ -Oj-i- 
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Note that this result subsumes the bound for Markov chains and hidden Markov chains [12, 15]. 
As [14] is still in preparation, we find it instructive to reproduce the proof here: 

Proof. Fix ann>0,l<i<j<n, y]~ G Cl'^~^ and Wi,w[ G (l. We will use the generic 
notation P{x) and P{x \ y) for the probabilities induced by ^ and p, consistently indicating observed 
sequences by a dot (') and hidden ones by a hat Q; no confusion should arise. We will occasionally 
drop subscripts and superscripts for readability. In this proof, whenever Kt{a\h) appears in an 
expression where t takes on the value 0, it is to be interpreted as po{a). Empty products evaluate 
to unity by convention. We expand 



il,,{y^\w,,w'i) = ij;|P(i,"|yi-iu;,)-niilyi 



-'^i) 



IT. 
IT. 



Y, (''(Ci *" I ST'*) - -pcCi *" I sT't^:) 






EEEE[^(r"ims-) 



^i + 1 ''I ^i + 1 J 



y z X 



ywzx 



fi;"-::i/p«^'') 



s \Y.T. 



rr*'*' 'yfT' 



EEE[^ 



■ 7 — 1 ^7 — 1 ^,i 



ywzx 
y z X 



yw zx 



/P{yw) 



^i;^::)m.-') 



(49) 



To make the above shorthand quite explicit, let us elaborate: 



Pizj-^iqiy^'w^ ^ P{x^, = ii;lxq\xi = yr'w^ 



ywzx 



s Pi x,» = sj-'toiCi'*". *" = siC 



;*?} 



y zx 
Using the definitions in (46) and (47), we rewrite (49) in matrix notation as 

n— 1 



r.h.s. of (49) 






i" x" t=j 



Xt 
Xt 



n— 1 



^Ek,^.IEEn^* 



i+1 -^j+i J 



Xt+l 
Xt+l 



Xt 
Xt 



(50) 
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where z G ]^"x^' jg ^ vector given by 



}^U-'^)IqU-2) . . .-^{i+l)l^ 



with h G l^f^xf^ given by 

i-2 



EIl'<' 



y\ *=i 



yt+i 
yt+i 



yt 
yt 



K. 



i-l 



K^-i 



Wi 



w, 



yi^i 



Vi-i 



K. 



i-l 



K^-i 



u 
u 



u 
u 



(51) 



Wi 



w, 



yi 



/pifr ^^) 



/Pifl'' ^'i) 



and K*^*) is a 



fl X fl 



fl X f] 



column-stochastic matrix given by 



\^ ](u,u),(v,v) = Kt 



V 
V 



Let us bound ||h||. Since 



P{y\''m) = Y.^' 



i-l 



2/1 



Wi 



yi 



i-2 



yi-1 
y 



/ t=i 



yt+i 
iit+i 



yt 
yt 



(and similarly for P{y\ w'j)) we have that h is a difference of convex combinations of conditional 
distributions: 



y^^a-yKi 



u 
u 



veu 



Wi 
V 



"^Oi'vKi 



u 
u 



ven 



W,: 



where a, a' > and ^a = ^q' = 1. Since the function f{x,y) = ||x — y|| is convex in both 
arguments, we have 



l|h|| < e^. 

The claim follows by applying the Markov contraction lemma to (52) and (51). 



(52) 
D 
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